Overview

Open data has been defined as “data and content [that] can be freely used, modified, and shared by anyone for any purpose” and the number of open data initiatives undertaken by local governments in the United States has increased dramatically over the past decade.

In addition to serving as a key source of data for staff and those internal to local government, these public-facing repositories are also being leveraged by a variety of users from civic hackers, to non-profit organizations, businesses, and individual residents.

Open data portals typically use one of three dominant platforms—ArcGIS Open Data, CKAN, and Socrata. Of these, CKAN is open source and free, while ArcGIS Open Data is part of ESRI’s suite of programs that many local governments already use for mapping and Socrata is billed as a “data as a service” platform with a tiered pricing system.

The majority of city open data portals in the U.S. use Socrata in part because it is “off the shelf” software. As a result, we can access data from a variety of these open data portals through the Socrata platform’s application programming interface of API. An API is essentially a program running on a remote server that standardizes information and requests in a way that servers can understand and manage efficiently. A standard definition of an API is a set of protocols and tools that allow interaction between two different applications.

Please watch the short video below, then proceed with the exercise.



The working directory is the default location where R stores files (output) and looks for files (input) and is typically only relevant when one is not operating within an R Project environment. This is where your datasets, scripts, etc. are found and the working directory can be any folder. In order to check which folder has been designated as the working directory, enter getwd() in the Console or click the Files tab in the lower right RStudio panel followed by the More button. If we are working in an R Script, we can set the working directory programmatically use the setwd function, or we can use the RStudio Files tab as described in the preceding sentence.


Note that for pathnames, you may use a forward slash / or you may use double back slashes \\ otherwise, you will get an error message. This is because the the back slash is special character that is used in string matching and so in order to distinguish is use as a special character versus its use as a regular old \ we use two in order to “escape” its interpretation as a special character. More on regular expressions in a few weeks though…

State of Emergency in San Francisco

On December 17, 2021 the mayor of San Francisco declared a State of Emergency in the city’s Tenderloin district. This step was taken in response to longstanding issues with public drug use, overdose deaths, and crime in the neighborhood that had recently intensified. From a legal perspective, the declaration reduces bureaucratic hurdles that might slow the implementation of the mayor’s Tenderloin Emergency Intervention Plan.



San Francisco is widely viewed as a tech hub and the city has been among the leaders in the U.S. in opening up its data resources. This lab exercise asks you to apply what we have learned to date to explore the underlying issues that led to the State of Emergency declaration.


Taking Inventory of the Repository

In the code chunk below, we will install and load a few packages to help us access data from the City of San Francisco’s open data portal. We will also specify which of the CRAN Mirror sites to download packages programmatically. In this example, I am pointing to the site hosted at Oak Ridge National Laboratory in Tennessee, but any of these sites will work (i.e., they are mirrors with the same content).


Note that inside the curly brackets of the code chunk I have set the message and warning options to FALSE which means that this kind of output generated by executing this code chunk will be displayed in the Console rather than appearing in the output document (in our case, an HTML file). You can also set the code chunk options by clicking the gear icon in the upper right corner of the code chunk.


options(repos=c(CRAN="https://mirrors.nics.utk.edu/cran/")) 
install.packages("RSocrata")
## package 'RSocrata' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\bw6xs\AppData\Local\Temp\RtmpuKFuo1\downloaded_packages
install.packages("janitor")
## package 'janitor' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\bw6xs\AppData\Local\Temp\RtmpuKFuo1\downloaded_packages
library(RSocrata)
library(jsonlite)
library(tidyverse)
library(janitor)


As was mentioned earlier, Socrata is a platform that is used by many local governments to publish open data. The Socrata Open Data API (SODA API) allows us to access these data resources from an R script or notebook in a standardized way. In the video you watched, the API was likened to a waiter or liaison who carries requests and responses back and forth between a client computer (i.e., you) and server computer (i.e., the open data portal). The documentation for the SODA API can be found here and begins with API endpoints, which are essentially Uniform Resource Locators (URLs) that provide access to data. Without getting into too much detail at the moment, we use the Hypertext Transfer Protocol (HTTP) to send requests and receive responses. Most often, we are using the GET method to send queries and retrieve data. For a quick refresher, take another look at this portion of the assigned reading for this week.


In some cases, an R package exists that allows us to interact with an API without having to construct queries that conform to the requirements of the API, and then pass them using a package like httr (this is what the ungraded exercise that we started in class last time asked you to do from your web browser). The RSocrata package that we installed and loaded above does this “dirty work” of formatting the underlying HTTP queries that allow us to interact with open data portals that use the Socrata platform… and there are a LOT of them out there.


Let’s take a closer look…

help(package="RSocrata")


The “Inspect RSocrata Documentation” code chunk simply opens a page with seven functions that are included in this package. Take a look at this page (i.e., in the Help tab in the lower-right corner), then proceed to Exercise 1 below.

Exercise 1


Review the documentation page that appears after running the code chunk above.

  1. Identify the function you would use to list datasets that exist on the City of San Francisco’s open data portal
    • Review the usage, arguments, and examples for that function
  2. Insert a new code chunk below this text (or at the very end of this .Rmd if you prefer)
    • Write and execute a line of code that retrieves a list of all datasets on the portal
    • Hint: You will need this URL https://data.sfgov.org
  3. Save the information returned by the function in 1. to an object (e.g., sf_data) using the assignment operator <-
    • What kind of object is this?
    • How many rows and columns are there?
    • What are the names of those columns?
    • Hint: Write a few sentences answering these questions in a Markdown section underneath your new code chunk

There are more detailed instructions on what to submit and how at the end of this notebook, but you essentially you should add code chunks and text chunks (i.e., Markdown sections) to this R Notebook that you have saved locally that perform the tasks and that answer the questions posed above.

Exploring the Inventory


Next, return to the documentation page for the RSocrata package. The read.socrata function can be used to retrieve a dataset from the San Francisco open data portal. In addition to indexing objects like lists or data frames by their positions using bracket notation, we can also refer to variable names, if they exist using the $ operator. Let’s sort the inventory and peruse the types of information available. The table function provides a count of instances within each category.

sf_contents <- ls.socrata("https://data.sfgov.org/limitTo=datasets")
names(sf_contents)
head(sf_contents)

sort(sf_contents$title)
table(sf_contents$theme)

sf_contents %>% dplyr::select(title) %>% dplyr::arrange(title) 
janitor::tabyl(sf_contents$theme)


How many datasets are there on the San Francisco portal right now? How many are related to Public Safety? Recall that the default behavior is for the output of a code chunk to appear below it, and the tabular results of the four functions that are executed after the sf_contents data frame object is created can be viewed there too. The names function displays the column or attribute names while the head function displays the first six rows of the sf_contents data frame object. The sort function places the contents of the specified column or attributes in alphabetical order. The last two lines simply demonstrate how you might do the same thing while avoiding base R functions like sort and table.


Tables are nice, but there must be a better way to present this information. Let’s create create a barchart of the portal’s contents with the ggplot2 package that we used in a practice exercise last week (we will begin formally working with tidyverse packages next week). Take a look at the arguments for the geom_bar function. The code chunk below also makes use of the ggsave function to export the barchart to a .png file.

ggplot(data = sf_contents) + 
  geom_bar(stat = "count", aes(x = as_factor(theme)), fill = "dodgerblue") + 
  labs(x="", y = "Number of Datasets") + 
  theme(axis.text.x = element_text(size = 8, angle = 90))

ggsave("Datasets By Theme.png", units = "in", width = 16, height = 8)


Now that we have a sense of what types of datasets are available, we can use what we have learned about subsetting to extract only those datasets related to Public Safety from the larger sf_contents data frame object that represents the holdings of the entire data portal. The line of code below creates a new data frame object where only the contents of the San Francisco data portals that are tagged as Public Safety are included. Insert a new code chunk or type it directly into the Console to try it yourself.

pubsaf_contents <- sf_contents[sf_contents["theme"] == "Public Safety", ]

sort(pubsaf_contents$title)

Again the `sort`` function just places things in alphabetical order. Do any of these datasets look like they might be relevant to the State of Emergency in the Tenderloin?

The code chunk below demonstrates how you might pull down a specific dataset of interest included in the inventory using bracket subsetting and the dplyr package (your choice).

police_calls_brackets <- sf_contents[sf_contents$title == "Police Department Incident Reports: 2018 to Present", ]

str(police_calls_brackets)
## 'data.frame':    1 obs. of  14 variables:
##  $ accessLevel : chr "public"
##  $ landingPage : chr "https://data.sfgov.org/d/wg3w-h783"
##  $ issued      : POSIXct, format: "2021-05-15"
##  $ @type       : chr "dcat:Dataset"
##  $ modified    : POSIXct, format: "2023-02-02"
##  $ keyword     :List of 1
##   ..$ : chr  "crime" "crime reports" "sfpd" "police department"
##  $ contactPoint:'data.frame':    1 obs. of  3 variables:
##   ..$ @type   : chr "vcard:Contact"
##   ..$ fn      : chr "OpenData"
##   ..$ hasEmail: chr "mailto:no-reply@data.sfgov.org"
##  $ publisher   :'data.frame':    1 obs. of  2 variables:
##   ..$ @type: chr "org:Organization"
##   ..$ name : chr "data.sfgov.org"
##  $ identifier  : chr "https://data.sfgov.org/api/views/wg3w-h783"
##  $ description : chr "<strong>A. SUMMARY</strong>\nRead the <u><a href=\"https://datasf.gitbook.io/datasf-dataset-explainers/sfpd-inc"| __truncated__
##  $ title       : chr "Police Department Incident Reports: 2018 to Present"
##  $ distribution:List of 1
##   ..$ :'data.frame': 4 obs. of  5 variables:
##   .. ..$ @type          : chr  "dcat:Distribution" "dcat:Distribution" "dcat:Distribution" "dcat:Distribution"
##   .. ..$ downloadURL    : chr  "https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.rdf?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.json?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.xml?accessType=DOWNLOAD"
##   .. ..$ mediaType      : chr  "text/csv" "application/rdf+xml" "application/json" "application/xml"
##   .. ..$ describedBy    : chr  NA "https://data.sfgov.org/api/views/wg3w-h783/columns.rdf" "https://data.sfgov.org/api/views/wg3w-h783/columns.json" "https://data.sfgov.org/api/views/wg3w-h783/columns.xml"
##   .. ..$ describedByType: chr  NA "application/rdf+xml" "application/json" "application/xml"
##  $ license     : chr "http://opendatacommons.org/licenses/pddl/1.0/"
##  $ theme       : chr "Public Safety"
##  - attr(*, "@context")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
##  - attr(*, "@id")= chr "https://data.sfgov.org/data.json"
##  - attr(*, "@type")= chr "dcat:Catalog"
##  - attr(*, "conformsTo")= chr "https://project-open-data.cio.gov/v1.1/schema"
##  - attr(*, "describedBy")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
class(police_calls_brackets)
## [1] "data.frame"
police_calls_tidyverse <- sf_contents %>% 
  filter(title == "Police Department Incident Reports: 2018 to Present")

str(police_calls_tidyverse)
## 'data.frame':    1 obs. of  14 variables:
##  $ accessLevel : chr "public"
##  $ landingPage : chr "https://data.sfgov.org/d/wg3w-h783"
##  $ issued      : POSIXct, format: "2021-05-15"
##  $ @type       : chr "dcat:Dataset"
##  $ modified    : POSIXct, format: "2023-02-02"
##  $ keyword     :List of 1
##   ..$ : chr  "crime" "crime reports" "sfpd" "police department"
##  $ contactPoint:'data.frame':    1 obs. of  3 variables:
##   ..$ @type   : chr "vcard:Contact"
##   ..$ fn      : chr "OpenData"
##   ..$ hasEmail: chr "mailto:no-reply@data.sfgov.org"
##  $ publisher   :'data.frame':    1 obs. of  2 variables:
##   ..$ @type: chr "org:Organization"
##   ..$ name : chr "data.sfgov.org"
##  $ identifier  : chr "https://data.sfgov.org/api/views/wg3w-h783"
##  $ description : chr "<strong>A. SUMMARY</strong>\nRead the <u><a href=\"https://datasf.gitbook.io/datasf-dataset-explainers/sfpd-inc"| __truncated__
##  $ title       : chr "Police Department Incident Reports: 2018 to Present"
##  $ distribution:List of 1
##   ..$ :'data.frame': 4 obs. of  5 variables:
##   .. ..$ @type          : chr  "dcat:Distribution" "dcat:Distribution" "dcat:Distribution" "dcat:Distribution"
##   .. ..$ downloadURL    : chr  "https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.rdf?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.json?accessType=DOWNLOAD" "https://data.sfgov.org/api/views/wg3w-h783/rows.xml?accessType=DOWNLOAD"
##   .. ..$ mediaType      : chr  "text/csv" "application/rdf+xml" "application/json" "application/xml"
##   .. ..$ describedBy    : chr  NA "https://data.sfgov.org/api/views/wg3w-h783/columns.rdf" "https://data.sfgov.org/api/views/wg3w-h783/columns.json" "https://data.sfgov.org/api/views/wg3w-h783/columns.xml"
##   .. ..$ describedByType: chr  NA "application/rdf+xml" "application/json" "application/xml"
##  $ license     : chr "http://opendatacommons.org/licenses/pddl/1.0/"
##  $ theme       : chr "Public Safety"
##  - attr(*, "@context")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
##  - attr(*, "@id")= chr "https://data.sfgov.org/data.json"
##  - attr(*, "@type")= chr "dcat:Catalog"
##  - attr(*, "conformsTo")= chr "https://project-open-data.cio.gov/v1.1/schema"
##  - attr(*, "describedBy")= chr "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
class(police_calls_tidyverse)
## [1] "data.frame"


We can see that the data frame object from above is fairly complex and consists of lists, vectors, and data frames. Because I spent some time looking at the API documentation, the dataset itself, and the results of the str function above, I know that the JSON endpoint is https://data.sfgov.org/resource/wg3w-h783.json and I can use the fromJSON function to pull this data into my R Notebook directly from the internet.

police_calls_1 <- fromJSON("https://data.sfgov.org/resource/wg3w-h783.json")

police_calls_tb <- as_tibble(police_calls_1)

str(police_calls_tb)
## tibble [1,000 × 33] (S3: tbl_df/tbl/data.frame)
##  $ incident_datetime          : chr [1:1000] "2021-07-25T00:00:00.000" "2022-06-28T23:58:00.000" "2022-03-11T10:30:00.000" "2021-05-15T17:47:00.000" ...
##  $ incident_date              : chr [1:1000] "2021-07-25T00:00:00.000" "2022-06-28T00:00:00.000" "2022-03-11T00:00:00.000" "2021-05-15T00:00:00.000" ...
##  $ incident_time              : chr [1:1000] "00:00" "23:58" "10:30" "17:47" ...
##  $ incident_year              : chr [1:1000] "2021" "2022" "2022" "2021" ...
##  $ incident_day_of_week       : chr [1:1000] "Sunday" "Tuesday" "Friday" "Saturday" ...
##  $ report_datetime            : chr [1:1000] "2021-07-25T13:41:00.000" "2022-06-28T23:58:00.000" "2022-03-11T20:03:00.000" "2021-05-15T17:47:00.000" ...
##  $ row_id                     : chr [1:1000] "105718906372" "116554371012" "113048071000" "103051807043" ...
##  $ incident_id                : chr [1:1000] "1057189" "1165543" "1130480" "1030518" ...
##  $ incident_number            : chr [1:1000] "216105573" "220264913" "226040232" "210183345" ...
##  $ report_type_code           : chr [1:1000] "II" "VS" "II" "VS" ...
##  $ report_type_description    : chr [1:1000] "Coplogic Initial" "Vehicle Supplement" "Coplogic Initial" "Vehicle Supplement" ...
##  $ filed_online               : logi [1:1000] TRUE NA TRUE NA NA TRUE ...
##  $ incident_code              : chr [1:1000] "06372" "71012" "71000" "07043" ...
##  $ incident_category          : chr [1:1000] "Larceny Theft" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
##  $ incident_subcategory       : chr [1:1000] "Larceny Theft - Other" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
##  $ incident_description       : chr [1:1000] "Theft, Other Property, $50-$200" "License Plate, Recovered" "Lost Property" "Vehicle, Recovered, Motorcycle" ...
##  $ resolution                 : chr [1:1000] "Open or Active" "Open or Active" "Open or Active" "Open or Active" ...
##  $ police_district            : chr [1:1000] "Southern" "Out of SF" "Central" "Out of SF" ...
##  $ cad_number                 : chr [1:1000] NA NA NA NA ...
##  $ intersection               : chr [1:1000] NA NA NA NA ...
##  $ cnn                        : chr [1:1000] NA NA NA NA ...
##  $ analysis_neighborhood      : chr [1:1000] NA NA NA NA ...
##  $ supervisor_district        : chr [1:1000] NA NA NA NA ...
##  $ latitude                   : chr [1:1000] NA NA NA NA ...
##  $ longitude                  : chr [1:1000] NA NA NA NA ...
##  $ point                      :'data.frame': 1000 obs. of  2 variables:
##   ..$ type       : chr [1:1000] NA NA NA NA ...
##   ..$ coordinates:List of 1000
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : NULL
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.5 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.5 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : NULL
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.5 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : NULL
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.7
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.5 37.7
##   .. ..$ : num [1:2] -122.5 37.7
##   .. ..$ : num [1:2] -122.5 37.8
##   .. ..$ : num [1:2] -122.4 37.8
##   .. ..$ : num [1:2] -122.4 37.7
##   .. .. [list output truncated]
##  $ :@computed_region_jwn9_ihcz: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_26cr_cadq: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_qgnn_b9vv: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_nqbw_i6c3: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_h4ep_8xdi: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_n4xg_c4py: chr [1:1000] NA NA NA NA ...
##  $ :@computed_region_jg9y_a9du: chr [1:1000] NA NA NA NA ...



We now have 1,000 records (recall that this is the default limit for some APIs) in a tibble format. If we wanted to write our own API query we could use the $limit parameter to retrieve more than the default 1,000 records, but this is okay for our purposes. Let’s explore the number and timing of police calls to the Tenderloin neighborhood.


Exercise 2


Execute the code chunk above and make sure you understand the code, then proceed with the tasks below:

  1. Insert a new code chunk (or at the very end of this .Rmd if you prefer) then and write and execute code that creates a new object that only contains police calls to the Tenderloin:
    • You will want to use the analysis_neighborhood attribute
  2. Write and execute code to create and export to .png format a barchart that shows at least one of the following:
    • Year
    • Day of the week
  3. Choose another neighborhood tabyl(police_calls_tb$analysis_neighborhood) and compare it with the Tenderloin in terms of police calls.
    • Generate a barchart for this comparison neighborhood
    • Interpret your findings in a Markdown section underneath your newly added code chunk


Take some time to reflect on this portion of the exercise, then proceed.


The Census API

Rather than downloading data from the official U.S. Census Bureau website, we can also access data programmatically from R. The first step is to take a quick look at the API documentation, then request an API key by visiting this site. You should receive a response in a few minutes, but be sure to safeguard your API key because it it tied to each individual user.

In order to get a sense for how API queries work, open a web browser and type the following into the search bar, inserting your own personal Census API key where indicated:

https://api.census.gov/data/2021/acs/acs5?get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110&key=YOUR_KEY_GOES_HERE

The resulting browser window should contain the results of the query above in JavaScript Object Notation (JSON) format. This is a compact way to store and transmit information over the internet that we will return to later in the semester. For now, we just need to know that the response there tells us:

So, we can send queries to the Census API for the 2021 ACS 5-Year Estimates directly from a web browser, but it is more efficient to do this from R. But first of all, what are the different components of the query we just executed?

This structure is unique to Census APIs and endpoints. If you want to interact with other APIs, you will first have to refer to their documentation and understand how to properly format a query. It is also possible that APIs change over time causing your code to stop working. Luckily, it is usually not onerous to make the necessary tweaks, once you understand how the API has changed.

Exercise 3


Still using a web browser, practice what we have learned here by constructing the proper API query in the search bar to:

  1. Retrieve the total population of the same two San Francisco ZCTAs based on the 2014-2018 ACS 5-Year Estimates
  2. Retrieve the unemployment rate for residents of the same two San Francisco ZCTAs based on the 2014-2018 ACS 5-Year Estimates
    • Hint: Take a look at variable "DP03_0005PE" which is located in a different API with a slightly different base URL here
  3. Now change geographies and retrieve the unemployment rate for the city as a whole
    • Hint: Try &for=place:67000&in=state:06


Data Profile API for Unemployment in #2 and #3 Above


Take some time to experiment with one or more of the other variables, then proceed with the exercise.

Using the examples provided by the Census, we can write some code that retrieves a variety of variables for the City of San Francisco as opposed to specific zip code tabulation areas.

# install.packages("httr")
library(httr)
library(jsonlite)

baseurl <- "https://api.census.gov/data/2018/acs/acs5?"
param1 <- "get="
param2 <- "&for=place:"
param3 <- "&in=state:06"
key <- "&key=f9f31e09fed0f44d23a8a354469e461df98f34cb"

# These are the place FIPS codes for San Francisco, Oakland, and Berkeley
places_list <- paste("67000", "53000", "06000", sep = ",")
vars_list <- c("B00001_001E", "B19013_001E")
vars_names <- c("Total population", "Median household income")

# Use a simple for loop to download two data elements rather than one
for (i in 1:length(vars_list)) {
  
req <- httr::GET(paste0(baseurl, param1, vars_list[i],
                        param2, places_list, param3, key))
req_json <- fromJSON(content(req, "text"), flatten=TRUE)
req_tibble <- as_tibble(req_json[2: dim(req_json)[1], ])
colnames(req_tibble) <- req_json[1,]

name <- paste0("these_results_", i)
assign(name, req_tibble)

}
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
# Combine the resulting data frame objects
merged_tibble <- inner_join(these_results_1, these_results_2, by = c("place", "state"))   
merged_tibble


The code chunk above uses the httr package to manually construct HTTP queries to retrieve desired pieces of information from the Census API. The httr::GET function is analogous to your web browser in that it sends the fully formatted request to the API. The paste0 function concatenates strings (e.g., baseurl, param1, etc.) and the httr::GET function “sends it over”. The jsonlite::fromJSON function helps us to take the very messy HTTP response and easily convert it to an R object (e.g., data frame, tibble, etc.) that we can work with. The assign function is part of base R and allows us to assign names to objects created inside a loop based on the counter i which increments each time we make our way through the code. Finally, the inner_join function from dplyr allows us to combine or join the tibble objects created by the loop into a single tibble object where we use the place and the state attributes as unique identifiers to do the joining.


Work Products

Please submit an R notebook and knitted HTML file that shows your work and responses for each of the Exercises included in this lab exercise (the code chunk directly above is an example and not a formal exercise). Also, briefly comment on your experience with R during this lab exercise. Please upload your report to Collab by 5:00 pm on Thursday February 9th.


Assessment Rubric

This Lab Exercise will be graded on a 100-point scale according to the rubric below:

Length and formatting (10 pts)

Clarity of writing and attention to detail (20 pts)

Technical Content (45 pts)

Reflective Content (25 pts)


© Bev Wilson 2023 | Department of Urban + Environmental Planning | University of Virginia